Automatic Construction of a Chinese Electronic Dictionary
نویسندگان
چکیده
In this paper, an unsupervised approach for constructing a large-scale Chinese electronic dictionary is surveyed. The main purpose is to enable cheap and quick acquisition of a large-scale dictionary from a large untagged text corpus with the aid of the information in a small tagged seed corpus. The basic model is based on a Viterbi reestimation technique. During the dictionary construction process, it tries to optimize the automatic segmentation and tagging process by repeatedly refining the set of parameters of the underlying language model. The refined parameters are then used to furtherget a better tagging result. In addition, a two-class classifier, which is capable of classifying an n-gram either as a word or a non-word, is used in combination with the Viterbi training module to improve the system performance. Two different system configurations had been developed to construct the dictionary. The configurations include (1) a Viterbi word identification module followed by a Viterbi POS tagging module and (2) a two-class classification module as the postfilter for the above Viterbi word identification module. With a seed of 1,000 sentences and an untagged corpus of 311,591 sentences, the performance for bigram word identification is 56.88% in precision and 77.37% in recall when the two-class classifier is applied to the word list suggested by the Viterbi word identification module. The Viterbi part of speech tag reestimation stage gives the figures of 71.16% and 71.81% weighted precision rates and 73.42% and 73.83% weighted recall rates for the 2 different configurations when using a seed corpus of 9676 sentences.
منابع مشابه
Automatic Construction of Persian ICT WordNet using Princeton WordNet
WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...
متن کاملStochastic Language Models for Automatic Acquisition of Lexicons from Printed Bilingual Dictionaries
Electronic bilingual lexicons are crucial for machine translation, cross-lingual information retrieval and speech recognition. For low-density languages, however, the availability of electronic bilingual lexicons is questionable. One solution is to acquire electronic lexicons from printed bilingual dictionaries. While manual data entry is a possibility, automatic acquisition of lexicons from sc...
متن کاملCreating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction
This paper first describes an experiment to construct an English-Chinese parallel corpus, then applying the Uplug word alignment tool on the corpus and finally produce and evaluate an English-Chinese word list. The Stockholm English-Chinese Parallel Corpus (SEC) was created by downloading English-Chinese parallel corpora from a Chinese web site containing law texts that have been manually trans...
متن کاملAutomatic Construction of a Japanese-Chinese Dictionary via English
This paper proposes a method of constructing a dictionary for a pair of languages from bilingual dictionaries between each of the languages and a third language. Such a method would be useful for language pairs for which wide-coverage bilingual dictionaries are not available, but it suffers from spurious translations caused by the ambiguity of intermediary third-language words. To eliminate spu...
متن کاملAutomatic Morphological Parsing of Chinese
This paper provides a basic design of an automatic morphological parser of Chinese that uses the syntactic word definition for word segmentation and tries to manage with as little resources as possible. Two possible resource bases are suggested, a dictionary of characters of Chinese with their default parts-of-speech or a small dictionary with some common words and their parts-of-speech to be u...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1995